home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Power Programmierung
/
Power-Programmierung CD 2 (Tewi)(1994).iso
/
doc
/
mir
/
19debloc
< prev
next >
Wrap
Text File
|
1992-06-29
|
11KB
|
285 lines
══════════════════════════════
9. DATA DEBLOCKING
══════════════════════════════
══════════════════════════════
9.1 An aid to analysis
══════════════════════════════
It is common practice to group several data records
together into a block, either of fixed or variable length. Before
input-output buffering was built into operating system software,
the use of blocks reduced the frequency of read/write instructions
and speeded up programs. The size of a block depended on (and
often matched) the physical record size of the storage medium.
In this topic, we examine several techniques of
separating blocks of data into records. The topic is introduced at
this point because deblocking is often done within the analysis
stage. Deblocking gets rid of byte counts or padding that have
nothing to do with the data being analyzed. Byte surveys are
cleaner when they are restricted to the data proper. The binary
component of a file may disappear completely through deblocking.
Blocks may be of fixed or variable length. The data
within a fixed length block may itself be fixed. Variable length
data can be found in blocks of any kind.
═════════════════════════════════
9.2 Reducing line records
═════════════════════════════════
Line records date back to punch cards. Continuous text
would be entered on a series of cards, with blank padding after the
last complete word that could fit on a given card. Recall the
NEWLINES program, introduced in topic 7.1:
NEWLINES blocked_in unblocked_out bytes_per_line
NEWLINES simply inserted line feeds and carriage returns at fixed
intervals in the data. For continuous text on 80 column punched
cards, this left blank padding at the end of almost every line. In
order to get rid of blanks at the end of lines in any ASCII text,
use the utility F_TRAIL:
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Usage f_trail [/4] < ASCII text > revised
Remove trailing blanks from lines of ASCII text. The /4
option is for backward compatibility only; it leaves a
blank in the fourth column where a line consists of a
three digit field number only.
input: Any printable ASCII file.
output: The same file with trailing blanks removed from each line.
writeup: MIR TUTORIAL ONE, topic 9
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
For example, these two commands might be used in sequence:
NEWLINES blocked.txt stage2.txt 80
F_TRAIL < stage2.txt > stage3.txt
The file STAGE2.TXT in this case would be fixed length lines of 80
bytes each, plus line feed and carriage return. STAGE3.TXT would
have variable length lines of text (none greater than 80) and a
line feed and carriage return at the end of each line.
The /4 option in F_TRAIL may be safely ignored. It
pads a three digit field number with a single blank; this single
blank pad is not required in MIR production format records. More
on this in MIR Tutorial TWO.
═════════════════════════════════════════
9.3 Handling fixed length records
═════════════════════════════════════════
In topic 7.3 we showed how to extract a single field
from a fixed length record. Here is a deblocking routine P_FIXED
which places all fields in continuous ASCII text:
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
usage: p_fixed control_file fixed_length_input > ASCII_output
Converts a fixed record length file to ASCII with field
numbers. A control file governs field lengths and
handling of empty data.
input: [1] A control file as in P_FIXED.CTL (also appears at
end of source code).
[2] The fixed length records data
output: ASCII output with one or more lines per field. New records
are signalled by a line containing 000; all other lines
begin with a three digit field number. Non-printable
characters are shown in hex format with leading backslash.
Additional processing may be needed to bring individual
fields into production indexing format.
writeup: MIR TUTORIAL ONE, topic 9
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Here is the template P_FIXED.CTL:
# Edit a copy of this file to use with P_FIXED.EXE in order
# to break out fixed length records. Each line consists of
# three numbers and zero or more codes; each element is
# separated by one or more blanks. The numbers are:
# field number
# start byte (followed by R if right half of byte only)
# end byte (followed by L if left half of byte only)
# A special line must be included with field number 0, begin
# byte 0, and end byte = last byte of record (i.e., record
# length - 1).
#
# Comment lines may be included. Each must start with #
#
# The codes that follow the three numbers are:
# B retain field if blank
# Z retain field if zeros
# N retain field if nulls
# LB retain leading blanks in field
# LZ retain leading zeros in field
# TB retain trailing blanks in field
#
0 0 53
1 0 27 LB TB
2 28 29
3 30 32L N
4 32R 34L
5 35 38
6 39 42
7 43 49
8 50 50
9 51 52
The last ten lines above are samples only. Simply edit
a copy of the template and give it a name of your choice. Then run
the command P_FIXED with appropriate file names:
P_FIXED my.ctl fixedlen.dta > ascii.dta
The output takes this form:
000
001 Text of field one
002 Text of field two
etc.
015 \9a\81
016 more data
etc.
The output contains only ASCII characters. Data that is in non-
printable form is converted to hexadecimal format a character at a
time. Note that \9a is a single byte; three characters are needed
to represent each hexadecimal value. Where a byte within a series
of hexadecimals happens to be printable, it is shown in its
printable form.
More processing may be required on some fields.
Tutorial TWO includes software for that purpose.
══════════════════════════════════════════════
9.4 Blocked records with ASCII lengths
══════════════════════════════════════════════
Variable length lines of ASCII text are sometimes
blocked with a four byte ASCII count at the beginning of each new
line. There is no line feed or carriage return at the end of a
line. The program DEBLOC_A may be used to deblock this kind of
data.
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Usage debloc_a ASCII_blocked_file > unblocked_version
Remove blocking, insert line feeds in ASCII blocked file.
input: ASCII file with four byte inclusive line lengths at the
beginning of every line, no line feeds at end.
output: Same data with counts out, line feeds/carriage returns in.
writeup: MIR TUTORIAL ONE, topic 9
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Data might look like this (usually with longer
lengths):
0016First field.0008No. 2001401234567890013That's it
Note the inclusive counting. The second field has only four bytes,
but the count adds another four bytes... 0008No. 2. Deblocking
that example would produce:
First field.
No. 2
0123456789
That's it
A byte survey of the blocked file would have heavy
concentrations of digits, especially of the digit zero. The data
itself may contain digits, but in much smaller proportions.
═══════════════════════════════════════════════
9.5 Blocked records with binary lengths
═══════════════════════════════════════════════
Newspaper and book publishers often use a blocking
format which has two levels. The blocking values are in binary.
The order of the binary bytes may vary. The source code for
DEBLOC_B assumes high order byte, low order byte, then two NULLs to
make up the four bytes in each case. Alter the source code in the
"get_data" function if you come across data with a different
sequence. There are typically two levels... a block of several
thousand bytes, and sub-blocks within each block. The counts are
inclusive.
The program DEBLOC_B deblocks two level binary blocked
data. It also addresses the problem that the data often originates
on mainframe computers which use the EBCDIC character set. Using
a program like EBC_ASC to convert from EBCDIC to ASCII of course
replaces the bytes holding the binary block and sub-block counts.
To ensure the correct count, DEBLOC_B provides for the situation by
reconverting the counting bytes.
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Usage debloc_b binary_blocked_file [/s][/e] > unblocked_output
Remove blocking, and (if not suppressed by /s argument)
insert line feeds. Argument /e must be used if file was
originally EBCDIC, in which case the block lengths must
be converted back to EBCDIC before they are interpreted.
input: File with four byte binary inclusive block lengths and
sub-lengths, two bytes in high to low order, then two
NULLs.
output: Same data with counts out, line feeds/carriage returns
in (unless suppressed).
writeup: MIR TUTORIAL ONE, topic 9
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
The /s option is used if there is fixed length data
included in the result. (I would have thought it unlikely, until
I was handed a nine-track tape containing such data.)
Notice the assumption that the data itself is printable
ASCII text. If that is not the case and you are working in DOS,
amend the source code to write to a named binary output file.
An ancestor variation of DEBLOC_B is included with the
source code. It has not been stylized for "copyleft", nor has it
been tested recently. The program is P_MARC.C, intended for
deblocking MARC records. MARC records were common for library
citation databases. A companion ASCII document, MARC_REC.DOC, is
also included with the software. It was reverse engineered from a
customer's data several years ago. Its accuracy is not assured.
≡≡≡≡->> QUESTION:
If you have access to data in MARC record format, could
you either furnish a sample, or (better yet) take a run
at upgrading both the MARC_REC.DOC document and the
P_MARC.C source code?
<<-≡≡≡≡
* * * * *
Apart from the Glossary/Index, this completes MIR
Tutorial ONE. You have tools and learning materials that should
equip you to analyze most kinds of data that are likely to be
indexed for search using normal ASCII search terms... words,
phrases, numeric values, subject categories, etc.